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Abstract 

This paper describes a new approach, based on linear programming, for computing nonneg- 
ative matrix factorizations (NMFs). The key idea is a data-driven model for the factorization 
where the most salient features in the data are used to express the remaining features. More 
precisely, given a data matrix X, the algorithm identifies a matrix C that satisfies X w CX 
and some linear constraints. The constraints are chosen to ensure that the matrix C selects 
features; these features can then be used to find a low-rank NMF of X. A theoretical analysis 
demonstrates that this approach has guarantees similar to those of the recent NMF algorithm 
of Arora ct al. (2012). In contrast with this earlier work, the proposed method extends to more 
general noise models and leads to efficient, scalable algorithms. Experiments with synthetic and 
real datasets provide evidence that the new approach is also superior in practice. An optimized 
C++ implementation can factor a multigigabyte matrix in a matter of minutes. 

Keywords. Nonnegative Matrix Factorization, Linear Programming, Stochastic gradient de- 
scent, Machine learning, Parallel computing, Multicore. 

1 Introduction 

Nonnegative matrix factorization (NMF) is a popular approach for selecting features in data [15— 
17,22]. Many machine-learning and data-mining software packages (including Matlab [3], R [11], 
and Oracle Data Mining [1]) now include heuristic computational methods for NMF. Nevertheless, 
we still have limited theoretical understanding of when these heuristics are correct. 

The difficulty in developing rigorous methods for NMF stems from the fact that the problem 
is computationally challenging. Indeed, Vavasis has shown that NMF is NP-Hard [26]; see [4] for 
further worst-case hardness results. As a consequence, we must instate additional assumptions on 
the data if we hope to compute nonnegative matrix factorizations in practice. 

In this spirit, Arora, Ge, Kannan, and Moitra (AGKM) have exhibited a polynomial-time 
algorithm for NMF that is provably correct — provided that the data is drawn from an appropriate 
model, based on ideas from [7]. The AGKM result describes one circumstance where we can 
be sure that NMF algorithms are capable of producing meaningful answers. This work has the 
potential to make an impact in machine learning because proper feature selection is an important 
preprocessing step for many other techniques. Even so, the actual impact is damped by the fact that 
the AGKM algorithm is too computationally expensive for large-scale problems and is not tolerant 
to departures from the modeling assumptions. Thus, for NMF, there remains a gap between the 
theoretical exercise and the actual practice of machine learning. 
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The present work presents a scalable, robust algorithm that can successfully solve the NMF 
problem under appropriate hypotheses. Our first contribution is a new formulation of the nonneg- 
ative feature selection problem that only requires the solution of a single linear program. Second, 
we provide a theoretical analysis of this algorithm. This argument shows that our method suc- 
ceeds under the same modeling assumptions as the AGKM algorithm with an additional margin 
constraint that is common in machine learning. We prove that if there exists a unique, well-defined 
model, then we can recover this model accurately; our error bound improves substantially on the 
error bound for the AGKM algorithm in the high SNR regime. One may argue that NMF only 
"makes sense" (i.e., is well posed) when a unique solution exists, and so we believe our result has 
independent interest. Furthermore, our algorithm can be adapted for a wide class of noise models. 

In addition to these theoretical contributions, our work also includes a major algorithmic and 
experimental component. Our formulation of NMF allows us to exploit methods from operations 
research and database systems to design solvers that scale to extremely large datasets. We develop 
an efficient stochastic gradient descent (SGD) algorithm that is (at least) two orders of magnitude 
faster than the approach of AGKM when both are implemented in Matlab. We describe a parallel 
implementation of our SGD algorithm that can robustly factor matrices with 10 5 features and 10 6 
examples in a few minutes on a multicore workstation. 

Our formulation of NMF uses a data-driven modeling approach to simplify the factorization 
problem. More precisely, we search for a small collection of rows from the data matrix that can be 
used to express the other rows. This type of approach appears in a number of other factorization 
problems, including rank-revealing QR [14], interpolative decomposition [19], subspace clustering [9, 
23], dictionary learning [10], and others. Our computational techniques can be adapted to address 
large-scale instances of these problems as well. 

2 Separable Nonnegative Matrix Factorizations and Hott Topics 

Notation. For a matrix M and indices i and j, we write for the ith row of M and M.j for 
the jth column of M. We write Mij for the (i,j) entry. 

Let Y be a nonnegative / x n data matrix with columns indexing examples and rows indexing 
features. Exact NMF seeks a factorization Y = FW where the feature matrix F is / x r, where 
the weight matrix W is r x n, and both factors are nonnegative. Typically, r <C min{/, n}. 

Unless stated otherwise, we assume that each row of the data matrix Y is normalized so it sums 
to one. Under this hypothesis, we may also assume that each row of F and of W also sums to 
one [4]. 

It is notoriously difficult to solve the NMF problem. Vavasis showed that it is NP-complete to 
decide whether a matrix admits a rank-r nonnegative factorization [26]. AGKM proved that an 
exact NMF algorithm can be used to solve 3-SAT in subexponential time [4]. 

The literature contains some mathematical analysis of NMF that can be used to motivate algo- 
rithmic development. Thomas [24] developed a necessary and sufficient condition for the existence 
of a rank-r NMF. More recently, Donoho and Stodden [7] obtained a related sufficient condition for 
uniqueness. AGKM exhibited an algorithm that can produce a nonnegative matrix factorization 
under a weaker sufficient condition. To state their results, we need a definition. 

Definition 2.1 A set of vectors {v±, . . . ,v r } C M. d is simplicial if no vector Vi lies in the convex 
hull of {vj : j ^ i}. The set of vectors is a-robust simplicial if for each i, the t\ distance from Vi 
to the convex hull of {vj : j ^ i} is at least a. Figure 1 illustrates these concepts. 
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Algorithm 1: AGKM: Approximably Separable 
Nonnegative Matrix Factorization [4] 

1: Initialize R = $. 

2: Compute the / x / matrix D with Dij = 

ll-^Q- — Xj. ||i. 
3: for k = 1, . . . / do 

4: Find the set A/& of rows that are at least 

be /a + 2e away from X^.. 
5: Compute the distance 8}. of X^. from 

conv({X,-. : j € A/" fc }). 
6: if Jfc > 2e, add A; to the set R. 
7: end for 

8: Cluster the rows in R as follows: j and k 
are in the same cluster if Dj^ < We/ a + 6e. 

9: Choose one element from each cluster to 
yield W. 

10: F = argmin ZeB ,/xr ||X - 1 




Figure 1: Numbered circles are hott topics. Their 
convex hull (orange) contains the other topics (small 
circles), so the data admits a separable NMF. The 
arrow d\ marks the l\ distance from hott topic (1) 
to the convex hull of the other two hott topics; defi- 
nitions of di and c?3 are similar. The hott topics are 
a-robustly simplicial when each dj > a. 



These ideas support the uniqueness results of Donoho and Stodden and the AGKM algorithm. 
Indeed, we can find an NMF of Y efficiently if Y contains a set of r rows that is simplicial and 
whose convex hull contains the remaining rows. 

Definition 2.2 An NMF Y = FW is called separable if the rows of W are simplicial and there 
is a permutation matrix II such that 
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To compute a separable factorization of Y, we must first identify a simplicial set of rows from 
Y. Afterward, we compute weights that express the remaining rows as convex combinations of this 
distinguished set. We call the simplicial rows hott and the corresponding features hott topics. 

This model allows us to express all the features for a particular instance if we know the values 
of the instance at the simplicial rows. This assumption can be justified in a variety of applications. 
For example, in text, knowledge of a few keywords may be sufficient to reconstruct counts of the 
other words in a document. In vision, localized features can be used to predict gestures. In audio 
data, a few bins of the spectrogram may allow us to reconstruct the remaining bins. 

While a nonnegative matrix one encounters in practice might not admit a separable factoriza- 
tion, it may be well- approximated by a nonnnegative matrix with separable factorization. AGKM 
derived an algorithm for nonnegative matrix factorization of a matrix that is well- approximated by 
a separable factorization. To state their result, we introduce a norm on/xn matrices: 

n 

max | Ajj | . 



lo °. 1 i<i<f 



3=1 



Theorem 2.3 (AGKM [4]) Let e and a be nonnegative constants satisfying e < 2Q " 13a . Let X 
be a nonnegative data matrix. Assume X = Y + A where Y is a nonnegative matrix whose rows 
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have unit t\ norm, where Y = FW is a rank-r separable factorization in which the rows of W 
are a-robust simplicial, and where || Ajj < e. Then Algorithm 1 finds a rank-r nonnegative 

factorization FW that satisfies the error bound ||X — FW"^ 1 < We/a + 7e. 

In particular, the AGKM algorithm computes the factorization exactly when e = 0. Although 
this method is guaranteed to run in polynomial time, it has many undesirable features. First, the 
algorithm requires a priori knowledge of the parameters a and e. It may be possible to calculate 
e, but we can only estimate a if we know which rows are hott. Second, the algorithm computes 
all l\ distances between rows at a cost of 0(f 2 n). Third, for every row in the matrix, we must 
determine its distance to the convex hull of the rows that lie at a sufficient distance; this step 
requires us to solve a linear program for each row of the matrix at a cost of Q(fn). Finally, this 
method is intimately linked to the choice of the error norm 1 1 • 1 1 1 . It is not obvious how to adapt 
the algorithm for other noise models. We present a new approach, based on linear programming, 
that overcomes these drawbacks. 



3 Main Theoretical Results: NMF by Linear Programming 

This paper shows that we can factor an approximately separable nonnegative matrix by solving a 
linear program. A major advantage of this formulation is that it scales to very large data sets. 

Here is the key observation: Suppose that Y is any / x n nonnegative matrix that admits a 
rank-r separable factorization Y = FW. If we pad F with zeros to form an / x / matrix, we have 
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M 



ny =: cy 



We call the matrix C factorization localizing. Note that any factorization localizing matrix C is 
an element of the polyhedral set 

$(Y) := {C > : CY = Y, Tr(C) = r, Cjj < 1 Vj, Qj < C j3 Vi,j}. 

Thus, to find an exact NMF of Y, it suffices to find a feasible element of C G &(Y) whose 
diagonal is integral. This task can be accomplished by linear programming. Once we have such a 
C, we construct W by extracting the rows of X that correspond to the indices i where Cu = 1. 
We construct the feature matrix F by extracting the nonzero columns of C. This approach is 
summarized in Algorithm 2. In turn, we can prove the following result. 

Theorem 3.1 Suppose Y is a nonnegative matrix with a rank-r separable factorization Y = FW. 
Then Algorithm 2 constructs a rank-r nonnegative matrix factorization ofY. 

As the theorem suggests, we can isolate the rows of Y that yield a simplicial factorization by 
solving a single linear program. The factor F can be found by extracting columns of C. 



3.1 Robustness to Noise 

Suppose we observe a nonnegative matrix X whose rows sum to one. Assume that X = Y + A 
where Y is a nonnegative matrix whose rows sum to one, which has a rank-r separable factorization 
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Algorithm 2 Separable Nonnegative Matrix Factorization by Linear Programming 

Require: An / X n nonnegative matrix Y with a rank-r separable NMF. 

Ensure: An / x r matrix F and r x n matrix W with F > 0, W > 0, and Y = FW. 

1: Find the unique C G &{Y) to minimize p T diag(C) where p is any vector with distinct values. 

2: Let I = {i : C u = 1} and set W = Yj. and F = Cj. 



Algorithm 3 Approximably Separable Nonnegative Matrix Factorization by Linear Programming 

Require: An / x n nonnegative matrix X that satisfies the hypotheses of Theorem 3.2. 

Ensure: An f x r matrix F and r x n matrix W with F > 0, W > 0, and II X — FW\\ , < 2e. 

J — ' — ' II lloo,l — 

1: Find C € <I>2e(^0 that minimizes p T diagC where p is any vector with distinct values. 
2: Let I = {i : Cu = 1} and set W = X/.. 
3: Set F = argmin ZeR / X r \\X - ZW^ 1 



y = FW such that the rows of W are a-robust simplicial, and where ||A|| 1 < e. Define the 
polyhedral set 

* T (X) := [C > : WCX-XW^Kt, Tr(C) = r, Cjj < 1 Vj, C l3 < Cjj 

The set $(X) consists of matrices C that approximately locate a factorization of X. We can prove 
the following result. 

Theorem 3.2 Suppose that X satisfies the assumptions stated in the previous paragraph. Further- 
more, assume that for every rowYj t . that is not hott, we have the margin constraint \\Yj t . — Y^^. \\ > do 
for all hott rows i. Then we can find a nonnegative factorization satisfying \\X — FWW^ 1 < 2e 

provided that e < ""gj"^"'" - ■ Furthermore, this factorization correctly identifies the hott topics 
appearing in the separable factorization ofY. 

Algorithm 3 requires the solution of two linear programs. The first minimizes a cost vector over 
3>2e(X). This lets us find W . Afterward, the matrix F can be found by setting 

F = argmin \\X - ZW\\ (2) 

Our robustness result requires a margin-type constraint assuming that the original configuration 
consists either of duplicate hott topics, or topics that are reasonably far away from the hott topics. 
On the other hand, under such a margin constraint, we can construct a considerably better approx- 
imation than that guaranteed by the AGKM algorithm. Moreover, unlike AGKM, our algorithm 
does not need to know the parameter a. 

The proofs of Theorems 3.1 and 3.2 can be found in the appendix. The main idea is to show 
that we can only represent a hott topic efficiently using the hott topic itself. Some earlier versions 
of this paper contained incomplete arguments, which we have remedied. For a signifcantly stronger 
robustness analysis of Algorithm 3, see the recent paper [12]. 

Having established these theoretical guarantees, it now remains to develop an algorithm to solve 
the LP. Off-the-shelf LP solvers may suffice for moderate-size problems, but for large-scale matrix 
factorization problems, their running time is prohibitive, as we show in Section 5. In Section 4, we 
turn to describe how to solve Algorithm 3 efficiently for large data sets. 
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3.2 Related Work 



Localizing factorizations via column or row subset selection is a popular alternative to direct fac- 
torization methods such as the SVD. Interpolative decomposition such as Rank- Revealing QR [14] 
and CUR [19] have favorable efficiency properties as compared to factorizations (such as SVD) that 
are not based on exemplars. Factorization localization has been used in subspace clustering and 
has been shown to be robust to outliers [9,23]. 

In recent work on dictionary learning, Esser et al. and Elhamifar et al. have proposed a factoriza- 
tion localization solution to nonnegative matrix factorization using group sparsity techniques [8,10]. 
Esser et al. prove asymptotic exact recovery in a restricted noise model, but this result requires 
preprocessing to remove duplicate or near-duplicate rows. Elhamifar shows exact representative 
recovery in the noiseless setting assuming no hott topics are duplicated. Our work here improves 
upon this work in several aspects, enabling finite sample error bounds, the elimination of any need 
to preprocess the data, and algorithmic implementations that scale to very large data sets. 

4 Incremental Gradient Algorithms for NMF 

The rudiments of our fast implementation rely on two standard optimization techniques: dual de- 
composition and incremental gradient descent. Both techniques are described in depth in Chapters 
3.4 and 7.8 of Bertsekas and Tstisklis [5]. 

We aim to minimize p T diag(C) subject to C G & T (X). To proceed, form the Lagrangian 



with multipliers f3 and w > 0. Note that we do not dualize out all of the constraints. The remaining 
ones appear in the constraint set $o = {C : C > 0, diag(C) < 1, and Cjj < Cjj for all i,j}. 

Dual subgradient ascent solves this problem by alternating between minimizing the Lagrangian 
over the constraint set <f>o, and then taking a subgradient step with respect to the dual variables 



where C* is the minimizer of the Lagrangian over $o- The update of Wi makes very little difference 
in the solution quality, so we typically only update (3. 

We minimize the Lagrangian using projected incremental gradient descent. Note that we can 
rewrite the Lagrangian as 



Here, supp(a:) is the set indexing the entries where x is nonzero, and \ij is the number of nonzeros 
in row j divided by n. The incremental gradient method chooses one of the n summands at random 
and follows its subgradient. We then project the iterate onto the constraint set <I>o- The projection 
onto $o can be performed in the time required to sort the individual columns of C plus a linear-time 
operation. The full procedure is described in Appendix B. In the case where we expect a unique 
solution, we can drop the constraint dj < Cjj, resulting in a simple clipping procedure: set all 



C(C, (3, w) = p T diag(C) + /3(Tr(C) - r) + £ w t (||^. - [CI],. \\ ± - r) 



i=l 



w i <-w i + 8(\\X i .-[C*X] i .\\ 1 -T) and /3 <- (3 + s(Tr(C*) - r) 
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Algorithm 4 Hottopixx: Approximate Separable NMF by Incremental Gradient Descent 

Require: An / X n nonnegative matrix X. Primal and dual stepsizes s p and s^. 

Ensure: An f x r matrix F and r x n matrix W with F > 0, W > 0, and II X — FW\\ , < 2e. 

J — ' — ' II lloo,l — 

1: Pick a cost p with distinct entries. 

2: Initialize C = 0, f3 = 

3: for t = 1,.. . ,N epochs do 

4: for i = l,...n do 

5: Choose k uniformly at random from [n]. 

6: C ^C + s p - sign(X. fc - CX. k )X T k - s p diag(/x o ((31 - p)). 

7: end for 

8: Project C onto <3> . 

9: p <- ? + s d (Tv{C) - r) 

10: end for 

11: Let I = {i : Cu = 1} and set W = X/.. 

12: Set F = argmin ZeR / X r ||X - ZW|| 1 



negative items to zero and set any diagonal entry exceeding one to one. In practice, we perform 
a tradeoff. Since the constraint Cij < Cjj is used solely for symmetry breaking, we have found 
empirically that we only need to project onto <E>o every n iterations or so. 

This incremental iteration is repeated n times in a phase called an epoch. After each epoch, 
we update the dual variables and quit after we believe we have identified the large elements of the 
diagonal of C. Just as before, once we have identified the hott rows, we can form W by selecting 
these rows of X. We can find F just as before, by solving (2). Note that this minimization can 
also be computed by incremental subgradient descent. The full procedure, called Hottopixx, is 
described in Algorithm 4. 

4.1 Sparsity and Computational Enhancements for Large Scale. 

For small-scale problems, Hottopixx can be implemented in a few lines of Matlab code. But for the 
very large data sets studied in Section 5, we take advantage of natural parallelism and a host of low- 
level optimizations that are also enabled by our formulation. As in any numerical program, memory 
layout and cache behavior can be critical factors for performance. We use standard techniques: 
in- memory clustering to increase prefetching opportunities, padded data structures for better cache 
alignment, and compiler directives to allow the Intel compiler to apply vectorization. 

Note that the incremental gradient step (step 6 in Algorithm 4) only modifies the entries of C 
where X.^ is nonzero. Thus, we can parallelize the algorithm with respect to updating either the 
rows or the columns of C. We store X in large contiguous blocks of memory to encourage hardware 
prefetching. In contrast, we choose a dense representation of our localizing matrix C; this choice 
trades space for runtime performance. 

Each worker thread is assigned a number of rows of C so that all rows fit in the shared L3 
cache. Then, each worker thread repeatedly scans X while marking updates to multiple rows of C. 
We repeat this process until all rows of C are scanned, similar to the classical block-nested loop 
join in relational databases [21]. 
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5 Experiments 



Except for the speedup curves, all of the experiments were run on an identical configuration: a 
dual Xeon X650 (6 cores each) machine with 128GB of RAM. The kernel is Linux 2.6.32-131. 

In small-scale, synthetic experiments, we compared Hottopixx to the AGKM algorithm and 
the linear programming formulation of Algorithm 3 implemented in Matlab. Both AGKM and 
Algorithm 3 were run using CVX [13] coupled to the SDPT3 solver [25]. We ran Hottopixx for 
50 epochs with primal stepsize le-1 and dual stepsize le-2. Once the hott topics were identified, 
we fit F using two cleaning epochs of incremental gradient descent for all three algorithms. 

To generate our instances, we sampled r hott topics uniformly from the unit simplex in W 1 . 
These topics were duplicated d times. We generated the remaining / — r(d + 1) rows to be random 
convex combinations of the hott topics, with the combinations selected uniformly at random. We 
then added noise with (oo, l)-norm error bounded by V'20^i3a- R- ecan that AGKM algorithm is only 
guaranteed to work for rj < 1. We ran with / G {40, 80, 160}, n G {400, 800, 1600}, r G {3, 5, 10}, 
d G {0, 1, 2}, and i] G {0.25, 0.95, 4, 10, 100}. Each experiment was repeated 5 times. 

Because we ran over 2000 experiments with 405 different parameter settings, it is convenient to 
use the performance profiles to compare the performance of the different algorithms [6]. Let V be 
the set of experiments and A denote the set of different algorithms we are comparing. Let Q a (p) 
be the value of some performance metric of the experiment p G V for algorithm a £ A. Then the 
performance profile at r for a particular algorithm is the fraction of the experiments where the 
value of Q a (p) lies within a factor of r of the minimal value of minbg^ Qb(p)- That is, 

p ( , _ #{P £ V ■ Qa(p) < Tmm a/( , A Q a ,(p)} 
a[T) ~ #(V) 

In a performance profile, the higher a curve corresponding to an algorithm, the more often it 
outperforms the other algorithms. This gives a convenient way to contrast algorithms visually. 

Our performance profiles are shown in Figure 2. The first two figures correspond to experiments 
with / = 40 and n = 400. The third figure is for the synthetic experiments with all other values 
of / and n. In terms of (oo, l)-norm error, the linear programming solver typically achieves the 
lowest error. However, using SDPT3, it is prohibitively slow to factor larger matrices. On the other 
hand, Hottopixx achieves better noise performance than the AGKM algorithm in much less time. 
Moreover, the AGKM algorithm must be fed the values of e and a in order to run. Hottopixx 
does not require this information and still achieves about the same error performance. 

We also display a graph for running only four epochs (hott (fast)). This algorithm is by far the 
fastest algorithm, but does not achieve as optimal a noise performance. For very high levels of noise, 
however, it achieves a lower reconstruction error than the AGKM algorithm, whose performance 
degrades once r\ approaches or exceeds 1 (Figure 2(f)). We also provide performance profiles for 
the root-mean-square error of the nonnegative matrix factorizations (Figure 2 (d) and (e)). The 
performance is qualitatively similar to that for the (oo, l)-norm. 

We also coded Hottopixx in C++, using the design principles described in Section 4.1, and ran 
on three large data sets. We generated a large synthetic example (jumbo) as above with r = 100. We 
generated a co-occurrence matrix of people and places from the ClueWeb09 Dataset [2] , normalized 
by TFIDF. We also used Hottopixx to select features from the RCV1 data set to recognize the 
class CCAT [18]. The statistics for these data sets can be found in Table 1. 

In Figure 3 (left), we plot the speed-up over a serial implementation. In contrast to other parallel 
methods that exhibit memory contention [20], we see superlinear speed-ups for up to 20 threads 
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Figure 2: Performance profiles for synthetic data, (a) (oo, l)-norm error for 40 x 400 sized instances and (b) 
all instances, (c) is the performance profile for running time on all instances. RMSE performance profiles 
for the (d) small scale and (e) medium scale experiments, (f) (oo, l)-norm error for the rj > 1. In the noisy 
examples, even 4 epochs of Hottopixx is sufficient to obtain competitive reconstruction error. 



data set 


features 


documents 


nonzeros 


size (GB) 


time (s) 


jumbo 


1600 


64000 


1.02e8 


2.7 


338 


clueweb 


44739 


351849 


1.94e7 


0.27 


478 


RCV1 


47153 


781265 


5.92e7 


1.14 


430 



Table 1: Description of the large data sets. Time is to find 100 hott topics on the 12 core machines. 



due to hardware prefetching and cache effects. All three of our large data sets can be trained in 
minutes, showing that we can scale Hottopixx on both synthetic and real data. Our algorithm 
is able to correctly identify the hott topics on the jumbo set. For clueweb, we plot the RMSE 
Figure 3 (middle). This curve rolls off quickly for the first few hundred topics, demonstrating 
that our algorithm may be useful for dimensionality reduction in Natural Language Processing 
applications. For RCV1, we trained an SVM on the set of features extracted by Hottopixx and 
plot the misclassification error versus the number of topics in Figure 3 (right). With 1500 hott 
topics, we achieve 7% misclassification error as compared to 5.5% with the entire set of features. 

6 Discussion 

This paper provides an algorithmic and theoretical framework for analyzing and deploying any 
factorization problem that can be posed as a linear (or convex) factorization localizing program. 
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Figure 3: (left) The speedup over a serial implementation for Hottopixx on the jumbo and clueweb data 
sets. Note the superlinear speedup for up to 20 threads, (middle) The RMSE for the clueweb data set. 
(right) The test error on RCV1 CCAT class versus the number of hott topics. The horizontal line indicates 
the test error achieved using all of the features. 



Future work should investigate the applicability of Hottopixx to other factorization localizing 
algorithms, such as subspace clustering, and should revisit earlier theoretical bounds on such prior 
art. 
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A Proofs 



Let Y be a nonnegative matrix whose rows sum to one. Assume that Y admits an exact separable 
factorization of rank r. In other words, we can write Y = FW where the rows of W are a-robust 
simplicial and 



UF 



M 



for some permutation II. Let I denote the indices of the rows in Y that correspond with the 
identity matrix in the factorization, which we have called the hott rows. Then we can write each 
row j that is not hott as a convex combination of the hott rows: 

Yj. = MjkY k . for each j £ I. 
fee/ 

As we have discussed, we may assume that J2k Mjk = 1 for each j £ I because each row of Y sums 
to one. 

The first lemma offers a stronger bound on the coefficients Mj k in terms of the distance between 
row j and the hott rows. 

Lemma A.l For an index i, suppose that the row Yg. has distance greater than 5 from a hott topic 
Y t . with i e I. Then M ei < 1 - 5/2. 

Proof We can express the £th row as a convex combination of hott rows: Yg. = X^fce/ -^fe^fe-- For 
each i G I, we can bound as follows. 



S< WY.-Yg. 



Y i .-Y J M lk Y k . 



< 



fee/ 

{l-M u )Y- M tkY k . 
fee/\{i} 

(l-M ei )Y\\ 1+ M tk\\Yk-\\i 
fee/\{i} 

= 2(1 - M ei ) . 

The inequality is the triangle inequality. To reach the last line, we use the fact that each row of 
Y has l\ norm equal to one. Furthermore, J2kei\{i} Mm = 1 — Ma > because each row of M 
consists of nonnegative numbers that sum to one. Rearrange to complete the argument. ■ 

The next lemma is the central tool in our proofs. It tells us that any representation of a hott 
row has to involve rows that are close in l\ norm to a hott row. To state the result, we define for 
each hott row i 

B 5 (i) = {j:\\Y i .-Y j .\\ 1 <6}. 

In other words, Bg(i) contains the indices of all rows with l\ distance no greater than 5 from the 
hott topic Y... 
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Lemma A. 2 Let c G be a nonnegative vector whose entries sum to one. For some hott row 
i G I, suppose that \\c T Y — Y^. ||i < r. Then 



E 



Zj>l 



2t 



min{a<5, a 2 } 



(3) 



Proof Let us introduce notation for the quantity of interest: Wi = Wi{c) = YljeB s (Xi ) c r We ma y 
assume that iOj < 1, or else the result holds trivially. Since the entries of c sum to one, we have 

< 1 - wi = Y c j ~ Y c i = Y c r 

j jeB s (i) j$B s {i) 

Next, we introduce the extra assumption that 5 < a. It is clear that Wi increases monotonically 
with 5, so any lower bound on Wi that we establish in this case extends to a bound that holds for 
larger 5. Since the hott topics are a-robust simplicial, all the other hott topics are at least S away 
from Y~i. in l\ norm. Therefore, the hott row i is the unique hott row listed in Bs(i). 

To establish the result, we may as well assume that Wi(c) achieves its minimum possible value 
subject to the constraints that the value of c Y is fixed and that c is a nonnegative vector that 
sums to one. We claim that this minimum such Wi occurs if and only if Cj = for all j G B${i) \ {i}- 
We complete the proof under this additional surmise. 

The assertion in the last paragraph follows from an argument by contradiction. Suppose that 
Wi(c) were minimized at a vector c where Cj > for some j £ Bs(i) \ {i}- Then we can construct 
another set of coefficients c that satisfies the constraints and leads to a smaller value of W{. We 
have the representation Yj. = ^2 keI Mj^Y^.. Set Cj = 0; set = c& + Cj^jk for each k £ I, and 
set Cfc = Cfc for all remaining k ^ / U {j}. It is easy to verify that c T Y = c T Y and that c is 
a nonnegative vector whose entries sum to one. But the value of W{ is strictly smaller with the 
coefficients c: _^ 

= Yj dk < Y ° k = 

k£B s (i) k£B s (i) 



In this relation, all the summands cancel, except for the one with index j. But Cj 



< Co. It 



follows that the minimum value of W{ cannot occur when Cj > 0. Compactness of the constraint 
set assures us that there is some vector c of coefficients that minimizes Wi(c), so we must conclude 
that the minimizer c satisfies Cj = for j G Bg(i) \ {i}- 

Let us continue. Owing to the assumption that Y^. is no farther than r from c T Y, we have 



r > 



Y°i Y i 



1 - w i)Yi- - Y c i Y r 



(i 



Wi 



Y. - 



Wi 



Y Y c i M i kYk - 



(4) 



3<jLB s (i) kel 

The first line follows when we split the sum over j based on whether or not the components fall in 
Bs{i). Then we apply the property that Cj = for j G B$(i) \ {i}, and we identify the quantity Wi. 
In the last line, we factored out 1 — Wi, and we introduced the separable factorization of Y. 
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Next, for each k & I, define 



1 - Wi 



E c i M ik, 



and note that ilk > 0. Furthermore, 



I> = r=^r E c ^E 

fee/ fee/ 



1 - 10; 



E 9,- = i 



because the rows of M sum to one and because of the definition of Wi. Lemma A.l implies that 7Tj 
satisfies the bound 



n = ^- E c ^\^f E <* = 1-6/2. 



(5) 



Indeed, the lemma is valid because Yj. is at least a distance of 5 away from Y[. for every j ^ .65(2) 
With these observations, we can continue our calculation from (4): 



T > (1 - Wi) 



fee/ 



(1 - Wi)(l - 7Tj) 



> (1 — I0i)(l - 7Tj)a 

> (l-«(<V2)a. 



fce/\{*} 



The first identity follows when we combine the iih term in the sum with The inequality depends 
on the assumption that W is a- robust simplicial; any convex combination of {Yfc. : k ^ /} is at 
least a away from in i\ norm. Afterward, we use the bound (5). Rearrange the final expression 
to complete the argument. ■ 



A.l Proof of Theorem 3.1 

This result is almost obvious when there are no duplicated rows. Indeed, since the hott topics form 
a simplicial set and the matrix Y admits a separable factorization, the only way we can represent 
all r hott topics exactly is to have Cu = 1 for every hott row i. This exhausts the trace constraint, 
and we see that every other diagonal entry Ckk = for every not hott row k. The only matrices 
that are feasible identify the hott rows on the diagonal. They must represent the remaining rows 
using linear combinations of the hott topics because of the constraints CY = Y and Cij < Cjj. It 
follows that the only feasible matrices are factorization localizing matrices. 

When there are duplicated rows, the analysis is slightly more delicate. By the same argument 
as above, all the weight on the diagonal must be concentrated on hott rows. But the objective 
p T diag(C) ensures that, out of any set of duplicates of a given topic, we always pick the duplicate 
row j where pj is smallest; otherwise, we could reduce the objective further. Therefore, the diagonal 
of C identifies all r distinct hott topics, and we select each one duplicate of each topic. As before, 
the other constraints ensure that the remaining rows are represented with this distinguished choice 
of hott topic exemplars. Therefore, the only minimizers are factorization localizing matrices that 
identify each hott topic exactly once. 
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A.2 Proof of Theorem 3.2 



Let X = Y + A. The matrix X is the observed data, with rows scaled to have unit sum, and the 
perturbation matrix A satisfies || A|| 1 < e. We assume that Y is a nonnegative matrix whose 
rows sum to one, and we posit that it admits a rank-r separable NMF Y = FW where W is 
a-robust simplicial. We write I for the set of rows corresponding to hott topics in Y. 

Suppose that Co is a factorization localizing matrix for the underlying matrix Y. That is, 
CqY = Y and each row of Co sums to one. It follows that 

II^oA-AL >1 <(||Co|L >1 + ||IL > J||A|| w>1 <2c. 

Using our decomposition X = Y + A, we quickly verify that 

||C X-X|| . < \\C Y-Y\\ . + IIC0A-AII ,<2e. 

II u II oo,l — ll u lloo,l II u II oo,l — 

The point here is that a factorization localizing matrix for Y serves as an approximate factorization 
localizing matrix for X . 

Our approach for constructing an approximate factorization of X requires us to minimize a cost 
function t T diag(C) over the constraint set 



2c 



(X) = [C > : \\CX - X\\^ < 2e, Tr(C) = r, C j3 < 1 Vj, dj < C 33 Vt, 3} ■ (6) 



Note that the factorization localizing matrix Co for Y is a member of this set, so the optimization 
problem we solve in Theorem 3.2 is feasible. 

Suppose that C G &2e(X) is arbitrary. Let us check that the row sums of C are not much 
larger than one. To that end, note that 

CI = CXI = XI + (CX - X)l = 1 + (CX - X)l. 

We have twice used the fact that every row of X sums to one. For any row c of the matrix C, this 
formula yields c T l < 1 + 2e since \\CX — X\\ < 2e. As a consequence, 

\\ CA - A L,i < (Halloo,! + l|iL,i)ll A L,i < (i + 2e + !) e = 2e + 2e2 - 

We may conclude that 

||CY-y|| ,<\\CX-X\\ , + IICA-All 1 <4e + 2e 2 . 

II lloo,l — II lloo,l II 1 1 00,1 — 

The margin assumption states that \\Yf. — 1^. || > do for every hott topic i £ I and every row 
I £ I. For any i £ I, Lemma A.2 ensures that any approximate representation c T Y of the ith row 
Y;. with error at most 4e + 2e 2 satisfies 

V- ^ 1 8e + 4e 2 
Ci= 1^ c i - 1 2V 

In particular, every matrix C in the set &2e(X) has Cjj > 1 — (8e + 4e 2 )/ min{a<io,a 2 } for each 
hott topic i. To ensure that hott topic i has weight Cu greater than 1 — l/(r + 1) for each i, we 
need 

/ m.m{ado,a 2 } m.m{ado,a 2 } 
£< V 4(r + l) 1K 9(r + l) 
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Since there are r hott rows, they carry total weight greater than r(l — l/(r + 1))- Given the trace 
constraint, that leaves less than 1 — l/(r + 1) for the remaining rows. We see that each of the r 
hott rows must carry more weight than every row that is not hott, so we can easily identify them. 
Once we have identified the set / of hott topics, we simply solve the second linear program 

minimize -X" — 
B 11 

to find a 2e-accurate factorization. 



Xj 



I oo.l 



(7) 



B Projection onto $o 

To project onto the set <E>o> note that we can compute the projection one column at a time. 
Moreover, the projection for each individual column amounts to (after permuting the entries of the 
column) , 

{x G R f : < Xi < xi Vi , x x < 1} . 

Assume, again without loss of generality, that we want to project a vector z with Z2 > z% > . . . > z n . 
Then we need to solve the quadratic program 

minimize — #|| 2 / a \ 

subject to < Xi < x\ Vi , x\ < 1 

The optimal solution can be found as follows. Let k c be the largest k € {2, ...,/} such that 

z kc +i < n [0) i] =: ^ 

where n^^j denotes the projection onto the interval [0, 1]. Set 



Xj 



H i < k, 
(zi)+ i > k, 



Then x is the optimal solution. A linear time algorithm for computing x is given by Algorithm 5 
To prove that x is optimal, define 

) Zi- fi i < k c 
Hi = < 

I min(zj, 0) i > k c 

yi is the gradient of \ \\x — z\\ 2 at x. Consider the LP 

minimize —y T x 

subject to < Xi < x\ \/i , x\ < 1 

x is an optimal solution for this LP because the cost is negative on the negative entries, on the 
nonnegative entries that are larger than k c , positive for 2 < k < k c , and nonpositive for k = 1. 
Hence, by the minimum principle, x is also a solution of (8). 
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Algorithm 5 Column Squishing 

Require: A vector z£l^ with Z2 > z% > . . . > z n . 

Ensure: The projection of z onto {a; G W : < X{ < x\ Vi , x\ < 1}. 

1: [I 4— Z\. 

2: for k = 2, . . . , / do 

3: if Zk < ri[ 0)1 ](/x), Set k c = k — 1 and break 
4: else set /x = + ^ fe . 

5: end for 

6: X X <- n [0il] (/x) 

7: for k = 2, . . . ,k c set = ri[ 01 ] (//). 

8: for k = (k c + 1), . . . , / set x fc = (^)+. 

9: return x 
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